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Abstract 

A pair of variables that tend to rise and fall either together or in opposition 
are said to be monotonically associated. For certain phenomena, this tendency 
is causally restricted to a subpopulation, as, for example, an allergic reaction 
to an irritant. Previously, Yu et al. (2011) devised a method of rearranging 
observations to test paired data to see if such an association might be present in 
a subpopulation. However, the computational intensity of the method limited 
its application to relatively small samples of data, and the test itself only judges 
if association is present in some subpopulation; it does not clearly identify the 
subsample that came from this subpopulation, especially when the whole sam¬ 
ple tests positive. The present paper adds a “top-AT” feature (Sampath and 
Verducci (2013)) based on a multistage ranking model, that identifies a concise 
subsample that is likely to contain a high proportion of observations from the 
subpopulation in which the association is supported. Computational improve¬ 
ments incorporated into this top-AT tau-path (TKTP) algorithm now allow the 
method to be extended to thousands of pairs of variables measured on sample 
sizes in the thousands. A description of the new algorithm along with measures 
of computational complexity and practical efficiency help to gauge its poten¬ 
tial use in different settings. Simulation studies catalog its accuracy in various 
settings, and an example from finance illustrates its step-by-step use. 

Keywords: Kendall’s tau, nonparametric correlation, ranking, copula, 
mixtures of distributions, unsupervised classification, computational 
complexity 


1. Introduction 

The emergence of “big data” has provided the opportunity for scientists to 
focus on key subpopulations. These are often identified by having higher mean 
values on some relevant aspect that is measurable by variables in the database. 
Here we change the criterion to having higher correlational values. 


Corresponding author. 

Email address: sainpath.5@osu.edu (Srinath Sampath) 





There are many potential applications. In cancer research, the mechanisms 
whereby some cancers become chemoresistant are inherent in their gene net¬ 
works, not all of which have been identified. One way to discover novel gene 
networks is to look for correlations between pairs of genes across many different 
types of cancer cells. The subpopulation of cancers that possess this network 
will match the subpopulation that supports the correlations. In marketing, a 
goal is to identify subpopulations that will be most responsive to a particular 
campaign. In finance, association between a company’s earnings and a key com¬ 
modity price may be restricted to certain macroeconomic conditions which are 
not known or directly measured. In all these cases, and more, the basic prob¬ 
lem is to identify, as concisely as possible, a highly correlated subsample that 
includes most of the members of the subpopulation that supports association 
between the variables. This is thus a kind of unsupervised classification, with a 
novel criterion. 

Previous work on this problem is relatively recent. Yu et al. (2011) developed 
the tau-path algorithm for re-sorting a sample so that the Kendall tau coefficient 
is (optimally) decreasing. In checking the behavior of the tail of this tau-path, 
they devised a test for correlation in a subsample versus the null hypothesis of 
independence. A limitation of operating under this null hypothesis is that the 
whole population tends to test positive when the supporting subpopulation is 
large or when the association is strong, which is not uncommon in datasets with 
large sample sizes. Bamattre et al. (2015) discussed some procedures for the 
much more difficult problem of testing for heterogeneity of association in which 
the distribution under the null hypothesis may have many different forms. A 
more modestly-scoped approach, suggested in Sampath and Verducci (2013), 
assumes a positive test for overall correlation, and attempts to discover if this 
correlation really is supported only within a proper subpopulation. They applied 
a multistage model of agreement between rankings to the re-sorted sample to 
indicate the point where predictive information stops. This method, called 
the top-AT tau-path (TKTP), is intended to provide good coverage of the true 
underlying supporting population, but this has not been investigated in detail 
until now, mainly because the original implementation was not scalable to large 
datasets. 

The current paper covers two issues not previously resolved. The first is 
extending the tau-path method to large samples. The original code was practical 
only up to sample sizes n < 100; the new code easily handles n up to 10,000 
on a personal computer. This extension to large samples makes it possible to 
address the second issue: how to characterize the performance of TKTP beyond 
the simple task of testing against independence. Simulation studies demonstrate 
the pattern between size of selected subsample and percent of the true samples 
included. 

Section 2 provides a thorough description of TKTP. Section 3 details the 
design of the new algorithms, including their computational complexity and 
runtime efficiency. Performance characteristics of their coverage proportions 
appear in Section 4, and a new example of predicting trends in stock prices in 
Section 5. The last section of the paper summarizes all findings and discusses 
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mathematical underpinnings. 


2. The Top-i^ Tau-Path (TKTP) Screen 

This section covers Kendall’s t measure of association, its corresponding 
tau-path, and multistage ranking models to show how these fit together to form 
the TKTP method of screening for a subpopulation. 


2.1. Concordance and Kendall’s r 

A standard approach to quantifying the strength of the relationship between 
two variables is to enumerate the number of concordances and discordances 
occurring in sample pairs, and then regard the difference between these counts as 
a measure of association. More precisely, if X and Y are random variables from 
a joint distribution F, with independent observations {Xi, Yi) and {X 2 , Y 2 ), the 
probabilities of concordance and discordance are, respectively, 

p, = P[{Xi-X2){Yi-Y2)>0] 

Pd = P[{X,-X2)iY,-Y2) <0], 
and the popular Kendall’s t measure of association is simply 


T =Pc- Pd- 


Note that r depends only on the ordinal properties of X and K, so that in a 
sample the data vectors X and Y may be replaced by their ranks with no loss 
of information about r. 

An unbiased estimator of r, the Kendall’s tau coefficient, can be computed 
from the random sample {{Xi,Yi), {X 2 , Y 2 ),..., (Xn, Y„)} by way of the n x n 
concordance matrix c = ((cj^j])), a tool that captures association between every 
sample pair in binary form as 


C[*j] = 


1, if the (f,j)th pair is concordant 
— 1, if the (i,j)th pair is discordant. 


and from which Kendall’s tau coefficient, computed as 


T = 



is an unbiased estimator of Kendall’s r. If A and D are the number of concordant 
and discordant sample pairs, respectively, then 


T = 


A-D 
A + D' 


With this theoretical framework. Subsection 2.2 delves into the idea of seri- 
ation, a common matrix reordering technique, and discusses the approach taken 
by Yu et al. (2011) to permute the concordance matrix and thus establish the se¬ 
quentially maximal monotone decreasing tau-path to uncover pockets of strong 
association in overall moderately associated sample data. 
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2.2. Seriation and the Tau-Path 

A simple yet powerful clustering technique to discover hidden structure and 
underlying patterns in data involves the permutation of multidimensional data 
along a single dimension. In archeology, the chronological ordering of similar 
relics could yield new insight into the different eras based on the similarity of 
adjacent objects. Known as seriation, this technique finds application in a wide 
variety of fields of scientific inquiry. Liiv (2010) provides an in-depth history of 
this technique and its use in many disciplines. In unsupervised learning, seri¬ 
ation often manifests in the form of matrix reordering, where multidimensional 
matrices are permuted based on the magnitude of a particular characteristic, 
potentially leading to insights into underlying associations in the data. 

In the context of Kendall’s r, Yu et al. (2011) permuted the sample concor¬ 
dance matrix in a very specific manner described in the next section. By means 
of either of two backward conditional search (BCS) algorithms, the Fast Back¬ 
ward Conditional Search (FastBCS) and the Full Backward Conditional Search 
(FullBCS) algorithms, they selected increasing subsets of the sample matrix in 
such a way that the associated tau coefficients become monotonically decreas¬ 
ing; the tau coefficient corresponding to the 2x2 subset is therefore at least 
as large as the one corresponding to the 3x3 subset, and so on. The resulting 
ordering of the sample matrix is referred to as the tau-path. 

Using the tau-path statistic, Yu et al. (2011) also propose a test for inde¬ 
pendence between the two variables versus the alternative that the variables 
are correlated only in a subpopulation. By simulating a large number of inde¬ 
pendent samples of size n from independent X and Y populations, establishing 
the tau-path for each sample, and then calculating the percentiles of Kendall’s 
tau coefficient at each step along the tau-path, and picking a fixed percentile 
to serve as the boundary at each step; for any fixed percentile q, the boundary 
points {qi\ i = 2,... ,n} are calibrated so that no more than a proportion a of 
the random tau-paths exceed the boundaries. Note that the same simulation 
may be achieved by sampling pairs of independent permutations of {1,..., n}. 

The tau-path approach thus provides an ordering of the sample bivariate 
data along the path of strongest association. So far the focus of this subsection 
has been on uncovering positive associations between the two variables; should 
a negative association between the variables be sought, this is readily accom¬ 
plished by multiplying the Y’s by —1 before performing the tau-path analysis. 

Another aspect of the ordered data that is worth computing is the stopping 
stage or endpoint of association] the stage in the tau-path ordered data where 
association ends and randomness sets in. Especially in these times when large 
datasets are readily available, and speed of computation and immediacy of re¬ 
sults are important considerations, the need to prune the full dataset down to 
the appropriately-sized subset for further analysis is critical. The key notion 
here is that as one progresses down the tau-path from most- to least-associated 
sample pairs, the two data elements in any pair should provide largely consis¬ 
tent information about their strength of association. When adjacent data pairs 
routinely provide inconsistent information about their strength of association, 
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it is possible that the level of association has deteriorated and randomness has 
taken over. In the context of the tau-path ordering, multistage ranking models 
provide an appropriate framework for extracting the endpoint of agreement. 

2.3. The Multistage Ranking Model and the TKTP Stopping Rule 

The primary challenge with traditional models that assess the degree of as¬ 
sociation between two ranked lists is their mathematical complexity. Whereas a 
multinomial distribution provides a full characterization of the population, the 
large number of ensuing parameters leads to intractable likelihood equations, 
even when the number of objects being ranked is small. Fligner and Verducci 
(1988) proposed a family of forward multistage ranking models that vastly re¬ 
duce the computational effort and also provide clear insights into the underlying 
processes. The multistage ranking framework is outlined here and the TKTP 
stopping rule then developed. 

Given a group of n objects, the relative value of the objects to an assessor— 
whether qualitative or numerical—can be expressed by the assignment of ranks 
to the objects. Two representations of the assessor’s preferences are popular; 
the first is the ranking or permutation, a vector of length n that gives, for each 
object on the original list, the assessor’s respective rank, namely the quantity 
(1 -I- the number of other objects that the assessor considers superior). This 
representation of the ranked preferences is denoted 

TT = [7r(l),...,7r(n)]. 

The second representation, an ordering or inverse permutation of the n ob¬ 
jects, is also a vector of length n, showing the labels of the n objects set in 
ranked order. Thus the ordering or inverse permutation corresponding to the 
representation tt above is 

7r“^(j) = i if 7r(i) = j, i = 1,... ,n, j = 1,... ,n. 

Our interest lies in measuring the extent of agreement between the prefer¬ 
ences of two assessors who independently rank the n objects, and in determining 
the last stage where agreement still exists and random noise follows, the stop¬ 
ping stage or endpoint of agreement between the two assessors. This is achieved 
by anchoring one assessor’s ranking of the objects as the reference ranking or 
ground truth, and then examining the stage-wise departures of the second asses¬ 
sor’s ranks from the corresponding reference ranks; the latter is the generated 
or observed ranking tt. Following the approach given by Fligner and Verducci 
(1988), the computations for the first stage and a generic stage j are given below. 
The stages are assumed to be independent, and penalties and truncated geo¬ 
metric probabilities are assigned to the second assessor’s ranks commensurate 
with their stage-wise deviations from the first assessor’s ranks as follows: 

Stage 1: Since all n objects are available, the second assessor selects the 
(1 -|-r’)th best object overall, as specified by 7r“^, and incurs the penalty Vi = v 
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with probability 


P(Fi = v) 



V = 0,... ,n — 1, 0 < ri < 1. 


Stage j (j = 2,..., n — 1): For each subsequent stage j, n — j + 1 objects are 
still available. Here the second assessor selects the (l+u)th best available object, 
as specified by and incurs a penalty Vj = v with probability 



The vector {Hi,, H„_i} thus captures the stage-wise deviations between 
the two sets of ranks, and is referred to as the discordance or penalty vector 
between the ranking schemes, {rj} too represents the stage-wise disagreement 
between the two assessors. Rather than focus on r^, the transformation 
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logrj, j = l,...,n- 1 


0 


is used, and a higher estimated 6j reveals a closer association between the as¬ 
sessors’ ranks for object j. 

It remains to determine the stopping stage or endpoint K, the last stage 
in the list of objects where the second assessor’s rank still shows some form 
of agreement with that of the first. Beyond this stage, the second assessor 
posts random stage-wise ranks relative to the first, and the previous association 
between the assessors’ ranks has faded. Subsection 3.1 of Sampath and Verducci 
(2013) provides the moving average maximum likelihood estimator (MAMLE) 
for the rj, calculated iteratively over overlapping backward-looking windows of 
stages i, j — w + l<i<joi fixed width w, till all the stages are exhausted. 
The resulting curve {rj} and its inverted counterpart {9j} are locally smooth 
estimators of the stage-wise agreement between the two assessors. 

The rejection region for the MAMLE, which enables the stopping rule, is now 
straightforward. Briefly, a large number of simulations are generated from the 
multistage model under the assumption that all the dj’s are 0, and stage-wise 
dj ’s are computed using the MAMLE method. For each stage j, the (1 — a)th 
quantile q{j) is computed. An estimate of the endpoint K is given by 

K = the earliest stage at which > q{K), and 9j > q{j) for at most 


a percent of the remaining j > K + w. (1) 


Additional diagnostics and guidelines on using the stopping rule are provided 
in Subsection 3.2 of Sampath and Verducci (2013). An alternative approach, 
based on a data-analytic method, is covered by Hall and Schimek (2012). 

The two techniques discussed above accomplish very distinct goals; the tau- 
path algorithm organizes bivariate data along the path from strongest to weakest 
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association, whereas the top-if MAMLE approach detects the endpoint of as¬ 
sociation between two ranked lists. Synthesis of the algorithms is accomplished 
by using the tau-path ordering from the first algorithm as the ordering of the 
stages of the second. 

Here the framework for the multistage model is that the pair (X, Y) of 
variables plays the role of a rater, ranking the observations {1,..., n} according 
to the tau-path ordering. The actual values of Kendall’s tau coefficient Tk over 
the first k stages along the tau-path are used to infer the strength of association 
at each stage. Choosing the stages according to the tau-path guarantees that 
the estimated 9j will be decreasing. This happens because the number of new 
discordances introduced at each tau-path stage must be non-decreasing. In 
this context, stopping rule (1) identihes observations in the remaining stages to 
have (X, Y) association no greater than noise. Since the ordered data is used as 
the input into the top-X algorithm, the computed endpoint gives the stopping 
stage where the strongly associated subsample ends. This subsample may be 
studied further to try to infer the underlying subpopulation. Section 5 provides 
an example where several pairs of variables are used to estimate a common 
subsample whose import may be identified by the behavior of an explanatory 
variable on this subsample. 

3. Algorithms, Computational Complexity and Efficiency of Imple¬ 
mentation 

The algorithms that underlie the TKTP method were developed to screen 
out observations for which a pair (X, X) of variables have association no greater 
than noise. The remaining observations are then most likely to represent a 
subpopulation that supports strong association. For large datasets this compu¬ 
tational task can be daunting. For example, in the analysis of microarrays used 
in molecular biology, gene chip platforms support increasing numbers of gene 
probes that can be attached to a microarray. In large-scale toxicity studies, hun¬ 
dreds of microarrays may be used to measure the effects of chemical compounds 
at various dosage levels over time. In a study measuring the expression levels 
of 30,000 genes responding to 3,600 chemical compound treatments, a dataset 
of 3,600 observations and 30,000 variables would be generated. In screening for 
relationships between pairs of genes, or 449,985,000 comparisons would 

be made. For each gene-gene pair identified, up to n = 3,600 observations could 
be combined into X]fe=2 (fc)from sizes 2 to n in the search for associated 
subsets. 

For large datasets, it is infeasible to examine computationally all pairs of 
variables and the population subsets within each pair. Critical to the design 
of the top-X and tau-path algorithms was an understanding of their order of 
growth. The variety of software languages and hardware architectures we con¬ 
sidered required care not to bias the specification of the algorithms toward 
a particular implementation context since programming languages, compilers, 
processor architectures, and memory hierarchies all affect the performance of 
any implementation. 
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Figure 1: The TKTP algorithm. 


The TKTP algorithm is the top-level algorithm. A flowchart for this algo¬ 
rithm appears in [Figure 1]. In Subsection 3.1 we provide an overview of the 
major steps of the TKTP algorithm and the functions it invokes. Since FastBCS 
and its optimized successor FastBCS2 are critical to the runtime performance of 
TKTP, these algorithms are described in detail in Subsection 3.1.2 along with 
a simple example showing how they work. In Subsection 3.2 we analyze the 
computational complexity of the FastBCS* algorithms, and in Subsection 3.3 
describe their efficiency and the various strategies we explored for implement¬ 
ing them. The pseudocode used to describe the algorithms mostly follows the 
conventions specified in Cormen et al. (2009). 

3.1. Overview of Algorithms 

3.1.1. The TKTP Wrapper 

The input to the TKTP algorithm shown in [Algorithm 1] is a random 
sample of observations for variables X and Y from a bivariate distribution. It 










































invokes four major functions to find a subset of strongly associated observations 
within this pair: FastBCS, TaupathMAMLE, GenerateRejectBoundary, and 
StoppingPoint: 


Algorithm 1 TKTP(A, Y) 

Input: A random sample S of n pairs {Xi.Yi), ..., (Xn,Yn) from a con¬ 
tinuous bivariate population. 

Output: A possibly empty subset s of observations in S in which X^ and 
Yg are strongly associated. 

Require: lenqth(X) = lenqth(Y) 

1: WINDOW ^5 
SIGLVL ^ 0.05 
NSIM ^ 10000 

// Find the permuted order of the observation pairs and the tau-path for 
A and Y using FastBCS*. 

2: pi,taupath •(— FastBCS2{X^Y) 

II Get moving average maximum likelihood estimators (MAMLEs) for X 
and Y. 

3: 0 <— TaupathMAMLE{taupath,WINDOW) 

II Simulate the MAMLE boundary at the (1 — a)th quantile of the null 
hypothesis. 

4: boundary ^ GenerateRejectBoundary{n, WINDOW, NSIM, SIGLVL) 
II Estimate the stopping point K. 

5: AT •<— StoppingPoint{0, boundary, SIGLVL) 

6: return {pi[j] \ [j > K] A [Oj > 9(j)]}, where q{j) is the (1 — a)th quantile 
of 0j 


Line 2. The FastBCS algorithm shown in [Algorithm 2] orders the pairs 
from the strongest to least associated and generates the tau-path of Kendall’s 
tau coefficients that measure the strength of the association for each nested 
subset from size 2 to n. 

Line 3. The tau-path and window are passed as arguments to the Taupath¬ 
MAMLE algorithm shown in [Algorithm 3] to generate a MAMLE curve for 
the X and Y pair. The tau-path is used in the penalty function to calculate 
the MAMLE [4] curve. The window width determines the smoothness of the 
curve [6-8]. This algorithm is used to generate the reference ranking of the first 
assessor’s choices for the X and Y pair of variables, and to simulate the second 
assessor’s choices under the null hypothesis. 

Line 4. To determine the rejection region under the null hypothesis, the 
GenerateRejectBoundary function (not shown) simulates the second assessor’s 
choices under the null hypothesis by generating the MAMLE curves for a pair 
of random variables of size n for the specified number of iterations and invoking 
TaupathMAMLE on each pair. From these MAMLE curves, the stage-wise 
(1 — Q;)th quantiles are calculated and the boundary representing the edge of 
the rejection region is returned. 
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Algorithm 2 FastBCS(A, Y) 

Input: A random sample of n pairs (Ai, Yi),..., (A„, y„) from a continuous 
bivariate population. 

Output: The final permutation pi and the tau-path statistic derived from 
the concordance matrix constructed from pi. 

Require: n = length{X) = length(Y); n > 1 

// The function indexOf{a, v) returns the index of the value v in array a. 
1: C -ir- the concordance matrix for {Xi,Yi),..., (A„, 1^) 
pi ^ [1... n] 
i ^ n 


2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 

16 

17 

18 
19 


tie[l.. .n] ^0 

repeat > backward conditional search 

for jA— 1 ... f do > backward elimination 

colsum[j\ ^ 

minsum t— minimum value in colsum 
tiesJ ^ {pi[l], I ^ 1 ■. ■ i\colsum[pi[l]] = minsum} 
if \ties- i\ > 1 then 
tie[i] ^ tiesJ 

r_tie •<— randomly select a member of tiesji 
I ^ indexOf{pi, rJie) 
transpose pi[i\ and pi[l] 
else 

I ^ indexOf {pi, ties J[l]) 
transpose pi[i\ and pi[l] 

for k ■ir- n downto (f + 1) do > tie logic 

if pi[i] € tie[k] then 

9* ^ [S«=l S«=l C'[pi[«].pi[i]]j • ■ • J J2u=l ^[piM.Pd*]]] 

9^ ^ E«=l C'[pi[M],pi[fc]]: C'[pi[ii],pi[/c]]j • ■ • ) X]u=l ^\pi[u\,pi[k]^ 

if All{qk[u] > qi[u])andAny{qk[u] > qi[u\),u = \...k — i + l 

then 


20 

21 

22 

23 

24 

25 

26 

27 

28 


transpose pi[i] and pi[k] 

i k — 1 \> forward step 

tie[m] -fr- [], Vm < k 

GOTO repeat 

i i — 1 > backward step 

until X]fc=i ^[piW.pdi]])) = (**(*” 1)) 

k ^ i 

T[2] i -^ T[k] ^ 1 

return The final permutation pi and the tau-path statistic {^[2]... T\n]} 
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Algorithm 3 TaupathMAMLE(tauPath, window) 

Input: 

- The tau-path of a random sample of n pairs (Ai, Yi),..., (A„, Y„) from a 
continuous bivariate population. 

- The size of the window. 

Output: An array of MAMLEs. 

Require: n = length{tauPath)] n > 1 
1: ma.theta •<— meanDif f •<— [ ] 

2: diffs ^ [0] 

3: for i •(— 2 ... n do 

4: totaldiscord[i — 1] ^ ((1 — tauPath[i])/2) * (*)) 

5: dijfs[i] <— (totaldiscord[i] — totaldiscord[i — 1]) 

6: for i ^ 0 ... {n — window — 1) do 
7: meanDif f ^ diffs[{i + 1)... (i + window)] 

8: ma.theta[i + window + 1] theta.scale{meanDif fs, i, window, n + 1) 

9: return ma.theta 


Line 5. The StoppingPoint algorithm shown in [Algorithm 4] estimates the 
stopping stage or endpoint K that indexes the end of agreement between the 
reference [XY tau-path) and generated (rejection boundary) rankings, by K. 


3.1.2. FastBCS and FastBCS2 

The Kendall’s tau coefficient of the concordance matrix (Yu et al. (2011); p. 
101) of a pair of variables is defined as 


T = 


A-D 

dlT 


( 2 ) 


where ( 2 ) is the number of distinct pairs of observations in the sample, and 
A and D are the number of concordant and discordant pairs of observations, 
respectively. The tau-path is a monotonically decreasing sequence of Kendall 
tau coefficients. The FastBCS algorithm (Yu et al. (2011); p. 103) generates the 
tau-path of a pair of statistical variables X and Y of a sample of observations. 
We have restated the algorithm in [Algorithm 2] and include a simple example 
to help explain it. All numbers in brackets below refer to line numbers in the 
FastBCS algorithm. 

The backward conditional search [2-25] consists of two main segments known 
as backward elimination [3-14] and tie logic [15-23]. The FastBCS incrementally 
constructs the permutation sequence pi backward from the set of all observations 
iSn at i = n to a subset of two observations S 2 at i = 2. The tau-path is 
constructed using the permuted order of pi after the main repeat loop exits. 
Each iteration i is called a stage. 

Prior to the search, the main data structures used during the operations are 
initialized [1]. The concordance matrix C of the set of observations for j = n 
is calculated using the natural ordering of the variables X and Y provided as 
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Algorithm 4 StoppingPoint(xyMainles, boundary, significanceLevel) 

Input: 

- The MAMLEs generated for the X and Y variables. 

- The rejection boundary under the null hypothesis. 

- The significance level of the rejection boundary. 

Output: The estimated value of the stopping point A or 0 if none was 
found. 

Require: n = length{xyMamies) = length(boundary)] n > 1 
1: exceed ■<— [] 

j ^ 1 

2: for i in 1... n do 

3: if mamles[i] > boundary[i] then 

4: exceed[j] e- i 

5: J ^J + 1 

6: numExceed •<— length{exceed) 

7: sort{exceed) 

8: for i in 1... length{exceed) do 
9: tail ^ n — exceed[i] 

10: left •<— numExceed — i 

11: if left < significanceLevel * tail then 

12: return candidate[i] 

13: return 0 


input to the algorithm. The values of the permuted index pi used to track 
the reordering of the observations at each stage are initialized to the natural 
ordering of X and Y. At each stage i, an i x i permuted matrix will be derived 
from C using pi. This is shown in the figures below as C^. The row and column 
headings of the permuted concordance matrix are the values indexed by pi and 
refer to the observation IDs shown in the headings of C. The tie array maintains 
a tieset for each stage. Each element of the tie array is set to empty tiesets. 

The search begins with backward elimination. The subset Si of observations 
pi[l.. .i] for any stage i is fixed. The goal of backward elimination is to find 
and eliminate the observation in the subset pi[\ .. .i\ that contributes least to 
Kendall’s tau coefficient—or increases most the value of D in Equation (2)— of 
the remaining subset of observations Si-i. Elimination is done by transposing 
this observation with the observation at pi[i\. The result of the transposition 
guarantees that Tj_i > Tj in this stage. If the observation represented by 
the ith stage, is not a member of any prior stages’ tiesets, a backward step is 
taken by setting i to i — 1 [24]. These steps are repeated until either Kendall’s 
tau coefficient for stage i becomes 1, or i = 2 [25]. 

A tie occurs at stage i if more than one observation could be eliminated 
from Si [7]. The tied observations form a tieset that is associated with stage 
i [8]. The observations in tiesets may be reexamined in the tie logic of later 
stages [16]. To complete the backward elimination at stage i in the presence of 
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ties, one observation is arbitrarily chosen and eliminated from Si [9]. Because 
the tau-path is constructed in reverse order, the algorithm cannot determine 
the effect of future rearrangements on local maximal monotonicity at the time 
a selection from the tieset is made. In subsequent stages k < i, the tie logic 
will reexamine the choice made in stage i [17-18] and take a forward step [21], 
if necessary, to ensure monotonicity in the tau-path. To convey an intuitive 
understanding for how the algorithm works, we walk through a simple example. 


Initialization 

C 

1 

2 

3 

4 

5 

C concordance matrix of 

1 

0 

-1 

-1 

1 

-1 

X, r 

2 

-1 

0 

-1 

1 

-1 

pi^ [ 1 , 2 , 3 , 4 ,5] 

3 

-1 

-1 

0 

-1 

1 

i ^ 5 

4 

1 

1 

-1 

0 

-1 

tie[l ... 5] •(— 0 

5 

-1 

-1 

1 

-1 

0 


Figure 2: Example: the initialization of the concordance matrix C calculated for a pair of 
variables X and Y with five observations. 


Example. Suppose we wish to generate a tau-path of the variables X = 
[1,2,4, 3, 5] and Y = [4, 3,1,5,2] of a sample of five observations. FastBCS is 
invoked with X and Y as arguments. The initialization [1] is shown in [Figure 
2]. It initializes the concordance matrix C based on the natural ordering of the 
pairs of observations in X and Y. The array pi is set to the natural ordering 
1, ..., 5. The current stage i is set to 5. Finally, each element of array tie 
which preserves the ties found at a particular stage i is initialized to the empty 
set. [Algorithm 2] begins the backward conditional search [2] and the first two 
stages, stages 5 and 4, are shown in [Figure 3]. 

All five observations from the sample are in subset S^. To find the ob¬ 
servation to eliminate from backward elimination begins by calculating the 
column sums of each of the observations from 1 to 5 of the permuted concordance 
matrix [3-4]. Because no transpositions have taken place, C 5 is identical to 
C. Four of the columns of C 5 sum to values of -2, so we generate the tieset 
of observations {1,2,3, 5} [ 6 ] and save it in tie[5] [ 8 ]. Although any of these 
four observations could have been randomly chosen for elimination, throughout 
this example we will always choose the first observation of the tieset. In stage 
5, by transposing pi[l] with pi[5], the observation that contributes least to the 
Kendall’s tau coefficient of S' 4 , pi is reordered so that the observations from 
pi[l] to pi [4] will generate T 4 > T 5 . No operations are performed in the tie logic 
since there is no fc > i. We set i to 4 [24] and proceed in similar fashion at 
stage 4 with pi = [5, 2,3,4,1]. Our interest in stage 4 is only in the first four 
elements of pi which constitute the subset and the concordance matrix C^. 
Four columns are summed and the tieset {5, 2,3,4} is saved in tie[4]. Since item 
4 is not a member of any tieset of fc > i, there are no prior tiesets to reexamine, 
so items 4 and 5 are transposed and i is set to 3. In both stages 4 and 5, only 
backward steps have been taken. 
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Stage 5 

Backward Elimination 

Calculate the column sums. 
Create a list of ties: 
tiesJ ^ [1, 2, 3,5]. 
Transpose pi [5] and pi[l]. 

Tie Logic 

There are no tiesets to ex¬ 
amine in stage 5. 

backward step: i 4 

Stage 4 

Backward Elimination 

Calculate the column sums. 
Create a list of ties: 
ties J ^ [5, 2, 3,4]. 
Transpose pi [4] and pi[l]. 

Tie Logic 

pi [4] = 4 is not a member of 
any tieset for /c > 4. 

backward step: i 3 
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Figure 3: Example: the main steps taken in stage 5 resulting in a backward step to stage 4. 


Stage 3 is shown in [Figure 4]. The steps in backward elimination find the sole 
observation 3 to contribute the least. No transposition is required. To ensure 
that pi [3] = 3 locally optimizes the tau-path, the tie logic [15-23] reexamines 
all tiesets from k down to i -I- 1 in which 3 is a member. Since observation 3 is 
a member of the tieset from stage 5, the processing continues by attempting to 
determine whether the tau coefficient at stage 3 could have been improved had 
the observation at pi [A: = 5] been chosen instead. The calculations are shown in 
the tie logic of [Figure 4]. In effect, the algorithm compares the cumulative sums 
of stage 3 of two different concordance matrices shown as C 3 and C 3 in [Figure 
4(a) and 4(b)], respectivefy. C 3 represents the permuted order from the choice 
that was made in stage 5. Cg represents the concordance matrix that would 
have resulted from choosing the observation pi [A: = 5], and is constructed by a 
transposition of the observations at pi [3] and pi [5]. The cumulative sum qi is 
calculated from C3 and qk from Cg [17-18]. Two tests must pass [19] if is to 
be considered the better choice: All of qk^^^^ > qi^^^ and Any qk^^^ > qiu= 3 - 
Both tests succeed and the algorithm continues with Cg as shown in [Figure 
4(c)]. 


14 















Stage 3 


Backward Elimination 

Calculate column sums. 
No ties to save. 

Tie Logic 

pi [3] = 3 is a member of 
stage 5 tieset. 

Determine locally optimal 
observation. 


(a) Calculate the cumulative sum qi of 
pi[i = 3] 


(b) Calculate the cumulative sum qk of 
pi[k = 3] 


(c) All{qk > qi) AND Any{qk > qi) 
test passes. 
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(d) FastBCS continues with Cg 
Clear the tiesets: tie[1..5] = 0 
Forward step: i ^ 4 

Figure 4: Example: the main steps taken in stage 3 that result in a forward step to stage 4. 
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The final stages, as shown in [Figure 5], follow a series of backward steps 
from stage 4 to stage 2 where the search ends [25]. The algorithm outputs pi 
and the tau-path statistic calculated from the concordance matrix generated 
from pi [28]. 


Stage 4f 

Backward Elimination 

There are no ties to save. 

Tie Logic 

pi [4] = 4 is not a member of 
any tieset for fc > 4. 

Stage 3f 

Backward Elimination 

Transpose pi [3] and pi [2]. 

Tie Logic 

pi [3] = 1 is not a member of 
any tieset for k > 3. 

The until test succeeds. Halt, 
pi = [4,1,2, 5,3] 

tau-path = [1,1, 0.333, —0.333, —0.4] 
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Figure 5: Example: the second pass through stages 4f and 3f resulting from a forward step. 

FastBCS guarantees only that a sequentially maximal monotone decreasing 
path is generated (Yu et al. (2011), p. 102). It does not lead to a unique per¬ 
mutation nor to a unique tau-path since the choice of ties at earlier stages may 
limit subsequent maximizations. For example, for the input X = [1,2,3, 5,4] 
and Y = [2,4,1,3, 5] in [Figure 5], our implementation of the algorithm will 
not find the permutation pi = [1,2, 5, 3,4] with a tau-path of [1,1,1, 0.333, 0.2]. 
Instead, it finds the permutation pi = [3, 5,4,1,2] with a corresponding tau- 
path of [1,1,0.333,0.333,0.2]. To find the former tau-path requires a stronger 
algorithm such as the FullBCS (Yu et al. (2011), p.l03). 

The FastBCS2 algorithm shown in [Algorithm 5] is being introduced to re¬ 
duce the 0(n^) running time of FastBCS to 0{n'^) as evidenced by the perfor¬ 
mance times of FastBCS2 illustrated with [Table 5] in Subsection 3.3.4. It has 
the same inputs and outputs as well as overall structure, but some of the oper¬ 
ations have been moved, altered, or merged. The differences will be described 
in Subsection 3.2. 
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Algorithm 5 FastBCS2(A, Y) 

Input: A random sample of n pairs (Ai, Yi),..., (A„, y„) from a continuous 
bivariate population. 

Output: The final permutation pi and the tau-path statistic derived from 
the concordance matrix constructed from pi. 

Require: n = length{X) = length{Y); n > 1 

j / The function indexOf{a, v) returns the index of the value v in array a. 
1: C -ir- the concordance matrix for {Xi,Yi),..., (A„, Yn) 
pi^[l...n]; tie[l...n]^0 
i ^ n 


2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 


for jA— 1... n do 

ColsUUl[j^ i 

repeat 

ties-i ^ {pi[l], I ^ 1 ■. ■ i\colsum[pi[l]] = minsum} 
if \ties- i\ > 1 then 
tie[i] t— tiesJ 

rdie ^ randomly select a member of tiesJ, 

I ^ indexOf{pi, rJie) 

transpose pi[i] andpi[i]; transpose colsum[i] and colsum[l] 

else 

I ^ indexOf{pi,tiesJ,[l]) 

transpose pi[i] andpi[i]; transpose colsum[i] and colsum[l] 
for jA— 1... i do 

colsum[j] ^ colsum[j] - 

for k ^ n downto (i + 1) do 

if pi[i] € tie[k] then 

9* ^ [S«=i C*[pi[u],pi[i]], X]«=i C*[pi[u],pi[i]], • ■ •, X]«=i C'[pi[u],pi[i]]] 
9^ ^ E«=l ^[pi[u\,pi[k]]: X]ii=l C[pi[u\,pi[k]]^ • ■ • ; X]tt=l ^ [pi[u] ,pi[k]^ 

if All{qk[u] > qi[it\) and Any{qk[u] > qi[ii\),u = 1. .. A: — i + 1 

then 
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22 

23 

24 

25 

26 

27 

28 

29 

30 

31 


for jA— i... fc do 
for I <— 1 .. .j do 

colsum[l] ^ colsum[l] + C’[pi(j),pi(;)] 

transpose pi[i] and pi[A:]; transpose colsum[i] and colsum[l] 
i ^ k — 1 

tie[TO] ^ 0) ^ 

for jA— 1... i do 

colsum[j] ^ colsum[j] - C'[pi(i),pj(fc)], 

GOTO repeat 

i i — 1 

until Sfc=i ^[pi[fe].p*b1])) = (i * (* ~ 1)^ 

k i 

T[2] < -^ T[k] ^ 1 

return The final permutation pi and the tau-path statistic {r[2]... T[n]} 


32 

33 

34 
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3.2. Computational Complexity 

The analysis of an algorithm requires computational models of the target 
platforms that will be used for execution of the implementations. Our anal¬ 
ysis of FastBCS was based on a random-access machine (RAM) model where 
operations are assumed to be performed strictly in sequence in constant time, 
and a data parallel model which defines computation as a sequence of instruc¬ 
tions or kernel applied synchronously to sets of processing elements each with 
access to private data memory. The programming model for parallelism will 
be discussed in more detail in Subsection 3.3.3 in the context of performance 
data collected from multicore and manycore devices. No attempts were made 
to model memory architectures, although the hierarchies and distribution of 
memory play an important role in the runtime cost of implementations of the 
FastBCS* algorithms. 

3.2.1. Analysis of FastBCS 

Our analysis of FastBCS focused on the runtime as determined by the cost 
and frequency of critical operations and inner loops. To simplify the analysis, 
the FastBCS algorithm was rewritten as procedural pseudocode with a syntax 
resembling familiar programming languages such as C, Pascal or Java (Cormen 
et al. (2009), pp. 20-22). [Table 1] shows the frequency analysis and order of 
growth of critical statements of the repeat loop of the algorithm. The baseline 
implementation of FastBCS in Java was instrumented to generate experimental 
data for validating order of growth using the doubling ratio (Sedgewick and 
Wayne (2011), pp. 192-193) and as a basis for the probabilistic analysis of the 
average-case running time. Probes were placed in the FastBCS source code just 
before lines 3, 8, 13, 17, 20, and 24 of the algorithm. We profiled the execution 
of an implementation of FastBCS using 1,000 samples for each of n = 200 to 
2,000 by 200 to determine the frequency of events at the probes, and to collect 
statistics on the forward distance, the position of k when a forward step was 
taken, the tieset size, and the value of i when the halting condition was met. 
The empirical cost models derived from this profile are also shown in [Table 1]. 

We begin by examining the frequency of the outermost loop. Recall that 
the FastBCS algorithm divides into two major sections within the repeat loop 
[2-24]. Backward elimination [3-14] performs the search for the observation to 
eliminate from the subset at each stage i. The tie logic [15-23] looks forward, 
beginning at the end of the tau-path being constructed, to determine if another 
member of a tieset created in a previous stage k that includes the observation 
pi[i\ might improve the Kendall’s tau coefficient for concordance matrix Ci at 
stage i. 

The backward step [24] is taken unless a forward step [21] is reached. While 
each backward step results in the execution of only one stage, each forward 
step increases the number of stages to reconsider by fc — * — 1, a value we call 
the forward distanee. The frequency of the repeat loop is E[Xfi\ = E[Xb] + 
E[Xpd] — E[Xh], where E[Xb] is the expected number of backward steps, 
E[Xfb] is the expected forward distance and E[Xh] is the expected value of i 
when the algorithm terminates. 
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Line 

ytatement 

~ li^requency 

Big-Oh 

2 

repeat 



3 

tor j -1— 1... i do 
for u f— 1... t do 

ColsUTn\J^ i C\pi\u],pi\j]] 



4 



4a 

E[Xr\ * {E) 

0{n'^) 

5 

minsum f— min value in colsum 



6 

tor i ^ 1... i do 

if {colsum[{pi[i]] = minsum) 
add pi[i] to tiesJ 



6a 



6b 

E\XF]*Tl^Ak) 

O(n^) 

7 

if (ties-i > 1) 
tie[i\ f— ties-i 

r_tie f— randomly select an observation 

1 f— indexO f{pi,rJ.ie) 
transpose pi[i] and pi[l] 



8 



9 



10 

E[XR\/2*n/2 

O(n^) 

11 



12 

else 

1 f— indexO f {pi,ties J[l]) 
transpose pi[i] and pi[l] 



13 

E[XR\/2*n/2 

O(n^) 

14 



15 

for k n downto (i + 1) do 
if pi[i] member tie[k] 

qi f— cumsum{C,pi[i],pi[i],pi[k]), ... 
qk f— cumsum{C,pi[k],pi[i],pi[k]), ... 
if All{qk >= qi) and Any{qk >= qi) 
transpose pi[i] and pi[k] 
i f— A: — 1 
for 1... fc do 
tie[m] f— [] 

GOTO repeat 
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E[XR]*j::ZUk) 

O(n^) 

17 



18 



19 



20 



21 



22 



22a 



^3^ 



24 

i f— z — 1 



25 

until X] 7 =i X]fc=i = (**(* — 1)) 

E[Xr] * (i2) 

0{n^) 


Empirical Cost Models 


Expected value of: 


Xfi\ = —22.06 + 0.97 * n number of iterations in repeat loop 
Xt = 0.494 + 0.00001 * n number of ties found 

Xm] = —0.08 + 0.19 * n number of times observation is member of tieset 

Xfd] = 0.51 + 0.001 * n the total forward distance 

Xh] = 21.61 + 0.03 * n value of i when the algorithm halts 


Table 1: Frequency analysis of FastBCS. 


The number of backward steps is at most n — 1, since the repeat loop will 
surely terminate [25] when i = 1. Whether a forward step is taken depends on 
the number and size of tiesets created, on whether the observation at stage i 
is a member of a tieset generated in a previous stage [7, 16] and on whether a 
correction to the tau-path under construction needs to be made [19]. A forward 
step erases all tiesets up to stage k [22], and restarts the backward search at 
k-1 [23]. 

Since backward elimination is performed every iteration and summing the 
columns [3-4a] dominates the order of growth with O(n^), it was important to 
understand the contribution of backward and forward steps to the frequency of 
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n 

Nr (line 3) 
(avg) (max) 

Nr (line 8) 
(avg) 

Tieset size 
(avg) 

in 

(avg) 

Nr/Nr 

200 

177.44 

199 

86.92 

3.6 

22.30 

0.49 

400 

367.22 

394 

182.59 

3.8 

32.71 

0.50 

600 

559.76 

589 

280.69 

3.9 

40.74 

0.50 

800 

753.51 

794 

380.03 

4.1 

47.41 

0.50 

1000 

948.04 

1028 

480.20 

4.1 

53.38 

0.51 

1200 

1142.79 

1209 

580.73 

4.2 

58.89 

0.51 

1400 

1338.15 

1410 

680.78 

4.3 

63.66 

0.51 

1600 

1533.73 

1601 

782.30 

4.3 

68.15 

0.51 

1800 

1729.84 

1788 

883.69 

4.4 

72.81 

0.51 

2000 

1925.58 

2008 

986.21 

4.4 

76.75 

0.51 


Table 2: Statistics of the count of iterations of the repeat loop (Nn), the count of the times 
ties were found {Nx), the size of a tieset (Tieset size), the value of i when the algorithm halted 
{Xjj), and the proportion of times ties were found within the repeat loop {Nx/Nn). 


the outermost loop. Statistics for the frequency of the loop and tiesets are shown 
in [Table 2], The number of forward steps is negligible. Empirically it was at 
most 4 for n < 2,000 (Column Nps in [Table 4]). Since the average number 
of iterations (Column Np) is less than N, we assumed E[Xji] ~ n. Although 
tiesets are generated about half of the time (Column Np/Np), they are small 
in size (Column “Tieset size”), decreasing the likelihood that an observation 
associated with stage i will be a member of a previous tieset [17]. 


n 

Nr (line 3) 
(avg) 

iVM(line 17) 
(avg) 

Npsi 

(avg) 

ine 20) 
(max) 

N 

(avg) 

FD 

(max) 

Nm/Nr 

^DU 

177.44 

39772 

TQ3” 

3 

0.7 

21 

0.22 

400 

367.22 

77.37 

0.13 

2 

0.9 

26 

0.21 

600 

559.76 

117.30 

0.18 

2 

1.5 

31 

0.21 

800 

753.51 

157.26 

0.17 

3 

1.9 

41 

0.21 

1000 

948.04 

196.40 

0.20 

4 

2.4 

81 

0.21 

1200 

1142.79 

235.18 

0.21 

3 

2.7 

64 

0.21 

1400 

1338.15 

274.71 

0.22 

3 

2.8 

73 

0.21 

1600 

1533.73 

315.88 

0.20 

3 

2.9 

71 

0.21 

1800 

1729.84 

352.78 

0.24 

3 

3.7 

59 

0.20 

2000 

1925.58 

391.29 

0.18 

3 

3.3 

89 

0.20 


Table 3: Statistics of the count of the iterations of the repeat loop (A^/?), the number of 
times the observation of stage i was a member of a previously saved tieset (A^m), the number 
of forward steps (Nps), the total forward distance (A^fd)^ the proportion of times tie 
membership was found to the number of iterations of the repeat loop (Nm/^r)- 


In [Table 3] approximately 20% of observations at stage i are members of 
a previous tieset (Column Nm/Nr). The distribution of the reset position 
of k for those samples in which a forward step was reached is shown in the 
boxplot of [Figure 6]. The cost for repeating stages resulting from a higher k as 
compared with a lower k increases quadratically. Similarly, the distribution of 
the total forward distance for these samples is shown in [Figure 7]. The increase 
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in the number of iterations by forward steps is offset by early termination when 
permutations result in fully concordant matrices before i = 1. The distributions 
of i at termination for different n are shown in [Figure 8]. 
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Figure 6: The distribution of the average fcth reset position in a forward step. 
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Figure 7: The distribution of the total forward distance. 


To reduce the order of growth, we chose the operations that contributed 
most to the runtime cost and could be adapted to a parallel environment. The 
computation of column sums [3-4a] in FastBCS became the focus for FastBCS2, 
and the platform for exploring how to improve the runtime cost of TKTP in a 
variety of computing environments. The strategies of implementation will be 
described in Section 3.3, but we first provide a brief overview of the changes 
that were made to FastBCS. 
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Figure 8: The distribution of the expected value of i when the algorithm halts. 


3.2.2. Analysis of FastBCS2 

The FastBCS2 algorithm is shown in [Algorithm 5]. The central idea of 
FastBCS2 was to minimize the calculations required to produce the column 
sums from the permuted concordance matrix Ci at each stage i. The summing 
of all columns of the initial concordance matrix C is done once with a cost of 
0{n^) before the repeat loop is entered and the results are saved in the array 
colsum [2-3]. 

Once the loop is entered, the colsum values requiring changes are updated 
whenever a transposition of observations occurs through the permuted index 
pi [10, 13, 15, 24, 27-28] or a forward step is taken [21-23]. How the incre¬ 
mental adjustments upward or downward are made to the column sums [14-15, 
21-23, 27-28] can be seen in [Figure 9]. The naturally ordered concordance 
matrix C is created and the column sums in [Figure 9(a)] are initialized as 
{—2, —2, —2,0, —2}. After the observation 5 is transposed with observation 1 
[10], the values of the row labeled 1 are subtracted from the column sums shown 
in [Figure 9(a)] leaving the values of the column sums shown in [Figure 9(b)]. 
The adjustment [15] is made only 5 times. Only the operation that sums the 
columns in a forward step [21-23] can be 0{n^) in the worst-case but, as we 
have seen, forward steps are rarely taken. The others [14-15, 27-28] are 0{n^) 
for the RAM computational model. 

Refactoring and distributing the column sum operations has two advantages. 
First, it allows the postponement of operations until they are needed. Second, 
it provides an opportunity for an implementation to use source code fragments 
as specifications for parallelism for implementations using compilers that can 
translate them for data-parallel computing environments. In the example in 
[Figure 9], were n < 1,000 and the implementation running on an accelerated 
processing unit (APU) with 1,000 processing elements (PE), the for loop could 
theoretically be done in constant time by assigning each PEi the subtraction 
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(a) The initialization of colsums for C. 


(b) The colsums for C 5 after transpos¬ 
ing pi [5] and pi[l] in stage 5. 
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Figure 9: An example of how FastBCS2 distributes the column sum calculations, (a) The 
initial values for the column sums are calculated before the repeat loop [Line 4]. 25 operations 
are performed, (b) The values of the row pi[5] = 1 are subtracted from the colsums of C to 
produce the colsums of C 5 [Lines 14-15]. 5 operations are performed. 


operation and the ith elements of the column sum array of C 5 and the transposed 
row pi [5] = 1. In practice, it is not straightforward. The next section discusses 
why. 

3.3. Efficiency and Implementation Strategies 

Three implementations of the FastBCS and FastBCS2 algorithms were de¬ 
veloped. The two of FastBCS in R (R Core Team (2015)) and Java were used 
to explore and cross-validate the algorithm’s implementations. FastBCS2, the 
third implementation, was an optimization of FastBCS implemented in Java 
and parameterized to execute critical sections in either sequential or sequential- 
parallel mode. This subsection provides a brief history of the implementations 
of the algorithms, the opportunities for optimization provided by parallelism, 
our methods and results, and a discussion of the results. 

3.3.1. Early Implementations 

The first reference implementation of the FastBCS algorithm was done in the 
R programming language to explore the tau-path algorithm. Optimization was 
not a goal. However, the need to screen larger datasets required more efficient 
implementations. Although the R environment is mostly implemented in C, type 
safety checking imposes a performance penalty when data is passed between R 
and C. To avoid this, the second implementation of FastBCS was written entirely 
in C. Using variables of size n = 512, our benchmarks showed a nearly 30x 
improvement in runtime performance over the R implementation. However, we 
found the C implementation unable to handle large toxicity microarray datasets 
and difficult to profile. To address these limitations and to begin exploring the 
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parallelism made possible by the introduction of Java 8, we created a faithful 
reproduction in Java of the original R implementation which became the baseline 
for runtime performance analysis. 

3.3.2. The Promise of Parallelism 

There are four parallel architecture categories in Flynn’s taxonomy (Rauber 
and Riinger (2013), p. 11): Single-Instruction, Single-Data (SISD); Multiple- 
Instruction, Multiple-Data (MISD); Single-Instruction, Multiple-Data (SIMD); 
and Multiple-Instruction, Multiple-Data (MIMD). Of these, the SIMD (many- 
core) and MIMD (multicore) architectures were of interest. In SIMD architec¬ 
tures, all processing elements execute the same instruction synchronously at 
each step, but each processing element has access to private data memory. In 
MIMD architectures, each processing element has private access to program 
and data memory, and each element works independently. For applications 
with a high degree of data parallelism, the SIMD approach can be very efficient 
and—excluding the broader context of integrating with software applications— 
simpler to program than MIMD computers. This is because a single program 
flow controls execution groups of processing elements, eliminating the need for 
synchronization at the program level (Rauber and Riinger (2013), p. 11) Recent 
advances in hardware and software created opportunities to reduce the order 
of growth of TKTP by integrating algorithm design with implementations for 
data-parallel computational models. 

Graphics Processing Units (GPUs) now contain thousands of processing el¬ 
ements. Where GPUs were once devoted to rendering pipelines for realtime 
graphics applications, their value to other data-intensive applications is now 
recognized. Heterogeneous computing environments are emerging which com¬ 
bine aspects of MIMD and SIMD architectures with a unified memory model. 
A unified memory eliminates the need to transfer large blocks of data between 
the hierarchies of memory devoted to either GPU or GPU. 

While algorithms often cannot be easily parallelized, many contain islands 
of computation that would be amenable to parallel execution with the right 
kind of programming language and system software support. Three things are 
required. First is a parallel programming model that provides an abstraction of 
a heterogeneous computing environment to insulate the software developer from 
the low-level software and hardware differences of each device. Second, the pro¬ 
gramming language used for the implementation must allow sequential and par¬ 
allel fragments to be expressed uniformly and in a declarative form that allows 
late-binding decisions to be made by the virtual machine as to which available 
compute device would most efficiently run various elements of the computation. 
Third, because software engineering tools are critical for algorithm design and 
profiling of the implementation, standards are needed in order for these tools 
to be produced. Software technology and standards have advanced far enough 
to make it possible to explore implementing algorithms as sequential-parallel 
high-level programs as we have done with FastBGS2* implementations. 
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3.3.3. Methods and Contexts 

The OpenCL standard (OpenCL) computing model views a system as a 
collection of compute devices such as CPUs or GPUs, each of which may contain 
multiple processing elements. OpenCL 1.0, based on the C99 specification, was 
a language for describing program fragments called kernels that execute on 
compute devices. Although we considered C++/OpenCL 1.0 for specifying 
kernels, we chose Java. In Java version 8, Oracle introduced parallel streams, 
lambda expressions, and language extensions that provide a declarative style 
of programming for parallelism. An experimental Java library (Frost (2015)) 
was also being developed at AMD to provide support for GPU parallelization. 
OpenCL, Java 8, and Aparapi became the foundation for our initial work on 
sequential-parallel algorithms. 

The first attempt to parallelize FastBCS was based on the Aparapi library. 
The sequential-parallel implementation was specified in Java. At runtime in the 
Java virtual machine (JVM), the GPU-parallel fragments compiled into Java 
bytecode were converted by Aparapi into the instructions and dispatched to the 
computing system’s GPU device. For the software developer, this eliminated 
the need to write and debug the OpenCL code that managed the dispatching 
of work items into queues. Our initial attempts at speedup were disappointing. 
Because of updates to the critical data structures for each stage of FastBCS2, 
the cost of data transfers and Aparapi’s overhead were significant and offset 
the gains from GPU computation. We refocused our efforts on what could be 
achieved using the extensions to Java 8 in multicore systems. 

The gains from the redesign of FastBGS2 came by moving the initial calcu¬ 
lation of column sums outside the scope of the repeat loop and distributing only 
the updates across the algorithm. The FastBGS2 gains come from the redesign 
of the FastBGS algorithm, reducing the order of growth in the average case from 
0(n^) to 0{n^) as evidenced in [Table 5] of Subsection 3.3.4. This also made 
possible the use of the parallel streams framework in Java 8. The FastBGS2 
implementation is parameterized to execute either sequentially or with parallel 
fragments on multicore GPUs. In sequential-parallel mode, the parallel streams 
are executed. The creation and initialization of the concordance matrix, column 
summation, and column sum updates attempt to run on as many cores as are 
available. 

To separate the effects of algorithm design, programming language, com¬ 
putational platform, and parallelism on runtime performance, the data were 
collected in six contexts on three platforms as described in [Table 4] . A context 
consisted of an implementation of either the FastBGS or FastBCS2 algorithm 
(FB or FB2); the programming language used to author the implementation 
and the runtime environment or virtual machine in which it was executed (R 
or Java); the hardware and operating system (PI, P2, or P3); and an execution 
mode as either sequential (s) or sequential-parallel (sp). 

3 . 3 . 4 . Results 

Three performance metrics were generated for each context: runtime, speed¬ 
up, and the doubling ratio. The runtime measures the total amount of work 
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Software 

Hardware 

Context 

Algorithm 

Lang. 

SDK 

OS 

Mode 

Device 

Cores 

Mem. 

li'B-KPls 

"TastBCS” 

R 

RStudio 

OSX 

s 

MP-17 

4 

16GB 

FB-JP2S 

FastBCS 

Java 

Java 8 

OSX 

s 

MP-I7 

4 

16GB 

FB2-JP2S 

FastBCS2 

Java 

Java 8 

OSX 

s 

MP-17 

4 

16GB 

FB2-JP3S 

FastBCS2 

Java 

Java 8 

Linux 

s 

EC2-Xeon 

16 

32GB 

FB2-JP2sp 

FastBCS2 

Java 

Java 8 

OSX 

sp 

MP-17 

4 

16GB 

FB2-JP3sp 

FastBCS2 

Java 

Java 8 

Linux 

sp 

EC2-Xeon 

16 

32GB 


Table 4: The six execution contexts. Context: The contexts are encoded as algorithm (FB 
or FB2), programming language and virtual machine (R or Java), the computing platform 
(PI, P2, or P3), and the execution mode (”s” or ”sp”). Algorithm: The algorithm being 
implemented. Lang: The statistical language R, or Java. SDK: The Software Development 
Kit used to load, or compile and run the implementations was RStudio 0.99.441 for R and the 
Oracle Java Development Kit 1.8.0-b25-17, respectively. OS: The operating system was either 
OSX 10.11 or Linux Ubuntu 14.04.2.LTS. Mode: The execution mode indicates whether the 
implementation was run sequentially (s) or as sequential-parallel (sp). Device: The Apple 
computer was a Macbook Pro with a 2.2 Ghz Intel 17 (MP-I7). The larger multi-core system 
was an Amazon EC2 c4.8xlarge instance with a 2.9 Ghz Intel Xeon E5-2666v3 (EG2-Xeon). 
Cores: The number of cores in the microprocessor. Mem: The size (GB) of random access 
memory available to the CPUs. 


performed by all processors. The input for each implementation was a pair of 
variables X and Y from a uniform distribution with input sizes that doubled 
from n = 250 to 32,000. The execution of an implementation of the FastBCS* 
algorithm was performed five times for each pair and the average time in mil¬ 
liseconds recorded. No other applications were running on the platform during 
the test. The results are given in [Table 5] and quadratic fits to average runtimes 
depicted in [Figure 10], with relative performance given in [Table 6]. Due to 
memory and time constraints, we were unable to complete the R implementation 
of FastBCS (FB-RPls) for n > 4, 000. 


Context 

250 

500 

1,000 

Vector length 

2,000 4,000 

[n) 

8,000 

16,000 

32,000 

FB-RPls 

441 

3,358 42,189 341,974 1,325,750 

* 

* 

* 

FB-JP2S 

17 

82 

575 

4,685 

34,850 303,481 

2,368,399 19,709,329 

FB2-JP2S 

5 

10 

58 

214 

916 

3,876 

16,867 

82,237 

FB2-JP3S 

3 

10 

33 

129 

579 

2,922 

11,586 

48,431 

FB2-JP2sp 

31 

33 

65 

171 

564 

2,083 

7,929 

36,506 

FB2-JP3sp 

38 

38 

83 

212 

560 

2,167 

6,733 

20,636 


Table 5: A table of the runtimes of different implementations of FastBCS* for pairs of variables 
X and Y of size n = 200 to 2,000 by 200 generated from a uniform distribution. Each cell 
is the average runtime in milliseconds of five executions. Cells marked with an asterisk were 
not completed because of memory and time constraints on the platform used for RStudio. 


The speedups shown in [Table 7] were derived from the runtime values. 
Speedup is defined as S{n) = T*(n)/T(n), where T*(n) is the execution time of 
a reference context and T{n) the execution time of the context being compared 
to it. The use of a sequential implementation as a reference is especially helpful 
in determining the benefit of parallelism (Rauber and Riinger (2013), p. 162). 
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N 

Figure 10: Quadratic fits to average runtimes of FastBCS2. 

The Java implementation of FastBCS (FB-JPls) was used as a reference for 
comparing speedup across all contexts. Three additional tables using different 
references for each comparison are shown in [Table 7] to illustrate the effect on 
speedup by the redesign of the algorithm [Table 7(a)], the number of cores in 
multicore systems [Table 7(b)], and the benefits of the Java 8 parallel stream 
framework [Table 7(c)]. 

The doubling ratio was generated using an implementation based on the 
algorithm shown in [Algorithm 6]. It provides an empirical measurement for 
approximating most common orders of growth. The doubling hypothesis is that 
as the limit of the ratio approaches ~2^, the running time is approximately an^ 
and this is an acceptable approximation of 0{n^) when the ratio is being used 
to predict performance (Sedgewick and Wayne (2011), p. 193). The results of 
the doubling ratio across all platforms are shown in [Table 8]. 

3.3.5. Discussion 

The first two rows of [Tables 5 and 6] show dramatic improvements in 
speedup in large n (n = 32, 000) from just two changes: 1) the use of a static 
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Context 

250 

500 

Vector length (n) 
1,000 2,000 4,000 8,000 

16,000 

32,000 

FB-RPls 

0.038 

0.024 

0.013 

0.012 

0.009 


* 

* 

FB-JP2S 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

FB2-JP2S 

3.2 

8.2 

9.9 

21.9 

38.0 

78.3 

140.4 

239.7 

FB2-JP3S 

5.3 

7.9 

17.4 

36.2 

60.2 

103.9 

204.4 

407.0 

FB2-JP2sp 

0.5 

2.5 

8.9 

27.4 

61.8 

145.7 

298.7 

539.9 

FB2-JP3sp 

0.4 

2.1 

6.9 

22.1 

62.3 

140.1 

351.8 

955.1 


Table 6: Speedup comparing all contexts using FB-JP2s as the reference implementation. 
Cells marked with an asterisk could not be computed because the runtime values were not 
generated. 






Vector length (n) 



Context 

250 

500 

1,000 

2,000 

4,000 

8,000 

16,000 

32,000 

(a) FB-JP2S 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

FB2-JP2S 

3.2 

8.2 

9.9 

21.9 

38.0 

78.3 

140.4 

239.7 

(b) FB2-JP2S 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

FB2-JP3S 

1.6 

1.0 

1.8 

1.7 

1.6 

1.3 

1.5 

1.7 

FB2-JP2sp 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

FB2-JP3sp 

0.8 

0.9 

0.8 

0.8 

1.0 

1.0 

1.2 

1.8 

(c) FB2-JP2S 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

FB2-JP2sp 

0.2 

0.3 

0.9 

1.3 

1.6 

1.9 

2.1 

2.3 

FB2-JP3S 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

FB2-JP3sp 

0.1 

0.3 

0.4 

0.6 

1.0 

1.3 

1.7 

2.3 


Table 7: A table of speedup comparisons using different references to show: (a) the effect 
on speedup of the redesigned algorithm FastBCS2 compared with FastBCS; (b) the effect on 
speedup of a 16-core system as compared with a 4-core system for sequential or sequential- 
parallel execution as well as factors such as processor clock speeds, hardware architecture, 
and cache sizes and speeds; and (c) the effect on speedup of sequential execution as compared 
with sequential-parallel execution on the same platform for two different platforms. 


language and its development and runtime environments (Java, SDK, and JVM) 
over a dynamic language (R and RStudio) and 2) minor changes to the algo¬ 
rithm’s design based on a frequency analysis and an empirical study of the 
algorithm’s behavior resulting from a randomization of the inputs. Comparing 
the doubling ratios of FB-RPls with FB-JP2s and FB-JP2s with FB2-JP2s at 
n = 2,000 and 4,000, shows that most of the improvement to the order of growth 
comes from the algorithm’s redesign. 

Further improvements come from parallelization through the use of the Java 
8 parallel streams for larger n. We see this by comparing FB2-JP2s with FB2- 
JP2sp and FB2-JP3s with FB2-JP3sp. In [Tables 7c and 8] speedup and the 
order of growth begin to improve on the 4-core processor at n = 2, 000, and 
on the 16-core processor at n = 8,000. More powerful hardware improves 
the speedup for a sequential implementation (rows FB2-JP2s and FB2-JP3s of 
[Table 7b]) but has no apparent effect on the order of growth (rows FB2-JP2s 
and FB2-JP3s for n > 2,000 in [Table 8]). 
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Algorithm 6 CalculateDoublingRatios(n^, ne) 

Input: ua and ub are the lower and upper limits of n. 

1: ITERATIONS ^ 5 
2: i ^ 1 
3: n ^ UA 

4: fnc ^ A function that invokes an implementation of FastBCS* 

5: T, ^ timeFastBCS{fnc,n, ITERATIONS) 

6: while n < Ub do 
7: i ^ i + 1 

8: n ^ 2n 

9: Ti ^r- timeFastBCS{ fnc,n, ITERATIONS) 

10: ratios[i\ ^ Ti/R-i 

11: if ratios\i\ approaches a limit then 

12: break 

13: return ratios 

14: function TiMEFASTBCS(/nc, n, ITERATIONS) 

15: t ^ 0 

16: for i 1 ... iterations do 

17: Generate variables X and Y of size n from a uniform distribution. 

18: tt + runtime{fnc{X,Y)) 

19: return t/iterations 


Since the FastBCS* algorithms operate mostly on homogeneous numeric 
data and a significant proportion of execution time in FastBCS* implementa¬ 
tions is spent on the calculation of concordance matrix column sums, manycore 
systems theoretically could further reduce the order of growth by mapping a 
computational kernel onto the data of a column or group of columns, allow¬ 
ing parallel calculation of each column’s sum. In our initial work with GPUs, 
however, we found that data transfer cost was a limitation with the current gen¬ 
eration of GPUs. When the GPU does not share memory with the host GPU, 
data transfers are required to maintain a coherent state between the GPU and 
GPU memories. The data transfer costs overwhelm any gain from parallelism. 
The next generation of heterogeneous computing environments will eliminate 
the need for data transfers through the inclusion of unified memory among de¬ 
vices. 

Standards for heterogeneous computing environments are emerging and the 
major object-oriented languages such as Java as well as newer languages such as 
Dart or Swift have the language features needed to express data-parallelism. Al¬ 
though advancements in heterogeneous computing environments make parallel 
programming increasingly attractive for algorithms like TKTP, the tools for de¬ 
veloping sequential-parallel implementation to run heterogeneously have not yet 
reached the maturity needed to allow these methods to be more broadly applied. 
We expect this to change within the next couple of years. Until then, implemen- 
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Vector length (n) 


Gontext 

250 

500 

1,000 

2,000 

4,000 

8,000 

16,000 

32,000 

FB-RPls 

NA 

2.9 

3.7 

3.1 

3.2 

* 

* 


FB-JP2S 

NA 

2.3 

2.8 

3.0 

2.9 

3.1 

3.0 

3.1 

FB2-JP2S 

NA 

0.9 

2.5 

1.9 

2.1 

2.1 

2.1 

2.3 

FB2-JP3S 

NA 

1.7 

1.7 

2.0 

2.2 

2.3 

2.0 

2.1 

FB2-JP2sp 

NA 

0.1 

1.0 

1.4 

1.7 

1.9 

1.9 

2.2 

FB2-JP3sp 

NA 

0.0 

1.1 

1.4 

1.4 

2.0 

1.6 

1.6 


Table 8: The binary logarithm of the doubling ratio, or lg{T(2n)/T{n)), where T is the 
execution time of an implementation of FastBCS* for X and Y from a uniform distribution. 
The values of the cells represent the power b = lg{ratio) which is an acceptable approximation 
of the order of growth 0{n^) when used to predict performance using the doubling hypothesis. 
Cells marked with an asterisk could not be computed because the runtime values were not 
generated. 


tations for sequential-parallel algorithms written in high level languages can be 
designed and executed on current generation multicore CPUs. 

4. Power and Accuracy Simulation Study 

In this section, simulation studies are used to exemplify the performance of 
TKTP in screening for sample points that come from a population in which the 
variables are associated. Since the method is based only on the relative rankings 
of values observed for each variable, it is invariant to scale transformations of 
the variables, and thus it is sufficient to limit the investigation of properties 
to copulae, which have uniform margins. Samples are simulated under three 
different types of mixing with the independent copula: 1) Frank, 2) Gaussian, 
and 3) a positively associated Frank copula and a negatively associated Gaus¬ 
sian copula. The objectives are to identify a suitable measure of performance of 
TKTP in screening for subsamples that support monotonic association of the 
variables; to check the robustness of this measure under different distributions 
with comparable association; and to provide guidelines for setting the opera¬ 
tional values of a and w to meet coverage requirements. Toward these aims, 
it is helpful to provide some pertinent background on the Frank and Gaussian 
copula families. For a thorough introduction, see Nelsen (2006). 

Although the Frank copulae are often indexed by r and the Gaussian copulae 
indexed by the population version p of Spearman’s correlation coefficient, which 
is also the product moment correlation of variables X and Y after their indi¬ 
vidual cumulative distribution function (GDF) transforms, [Figure 11] shows 
how either parameter may be used to index either family of copulae. In par¬ 
ticular the Frank copulae with r = 0.3,0.5 or 0.7 have corresponding values of 
p = 0.45, 0.7 or 0.89, respectively. 

Each line in [Figure 11] was formed from 100 summary points, with each 
point based on a sample of 100,000 points from one of the copulae. For each 
sample generated from a Frank copula with a given r, a Spearman correlation 
coefficient p was calculated, which essentially equals the population p since the 
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Figure 11: Corresponding population values of Kendall’s r and Spearman’s p from Frank and 
Gaussian copulae. 


sample size is so large. Similarly, samples from the Gaussian copulae were gen¬ 
erated with fixed values of p and the corresponding population r was estimated 
from these large samples. Note that for any copula the population value of 
Spearman’s p is the same as the population (Pearson) product moment corre¬ 
lation between the uniformly distributed marginal variables. 

In the figure, the solid black line is based on points generated from Frank 
copulae, and the solid red line is based on points generated from Gaussian 
copulae. Dashed lines are inversions of the solid ones, and colors indicate either 
p versus r (black) or r versus p (red). Although there exist distributions in 
which the relationship between population r and p may be quite different from 
that given above, this relationship remains quite stable across the Frank and 
Gaussian copulae, as seen from the slight difference between solid and dashed 
lines of the same color. 

Although a Frank and a Gaussian copula may share the same values of r and 
p, the distributions themselves are generally quite distinct. Density contours of 
the Frank and Gaussian copulae, both having t = 0.5 and p = 0.7, are depicted 
in [Figure 12]. 

Now consider the performance of TKTP when applied first to various mix¬ 
tures of Frank and independent copulae. This first simulation study follows 
a 3 X 3 X 2 complete block design, with three levels of sample sizes (n = 
100,500,1000), three levels of association strength in the subsample (r = 0.3, 
0.5, 0.7) and two levels of mixing proportion of the associated subsample {p = 
0.3,0.4). For each of the 18 combinations of these parameters, 100,000 samples 
are generated from the Frank mixture, and the results of the (a = 0.05; w = 3) 
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Figure 12: Density contours of Frank (left) and Gaussian (right) copulae, both with r = 0.5 
and p = 0.7. Contours depict density levels of 0.5 to 5 in steps of 0.05. 


TKTP screen recorded. [Figure 13] summarizes the major finding. 

In [Figure 13], each point represents the paired means from 100,000 simula¬ 
tions. “Rate” refers to the increase in percent coverage with increase in percent 
of sample selected. The plot character represents sample size (circle n = 100; 
triangle n = 500; plus n = 1,000). Color represents strength of association 
(black —)■ r = 0.3; red —i r = 0.5; green —i r = 0.7). The size of each symbol 
represents the proportion of associated subsample in the mix (small —> 30% of 
observed pairs are associated; large —i 40%). 

The stopping point is the point at which the algorithm determines the pre¬ 
dictive power of the remaining sample reduces to chance levels. Percent covered 
is the percent of observations from the Frank copula that appear before the 
stopping point and are thus included in the screen. 

Increasing sample size from n = 100 to n = 500 has a big effect on both 
stopping point and percent coverage, but differences between these variables 
at n = 500 and n = 1, 000 are small. The biggest effect of sample size is on 
the distributions of these two variables, which are highly skewed left for small 
sample sizes, but approach normality for larger sample sizes. 

A larger underlying proportion (0.4 versus 0.3) always increases the mean 
stopping proportion of the range for any sample size and any r, but does not 
necessarily increase mean coverage probability. The method has little power to 
detect proportions less than 2-y/n—1.758n^/®, which is the expected length of the 
longest increasing subsequence under independence (Baik et al. (1999), Aldous 
and Diaconis (1999)), and it is also difficult for the algorithm to distinguish a 
large associated subpopulation from the whole population. In the limited range 
investigated here, size of the underlying associated subpopulation does not have 
a big effect on the algorithm. 

On the other hand, strength of association in the subpopulation has a sub¬ 
stantial effect on the performance of TKTP. Stronger association (increasing r) 
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Figure 13: Summary of TKTP {a = 0.05, w = 3) simulations under Frank mixtures of copulae. 


produces a substantial increase in both mean stopping point and mean coverage 
for any fixed sample size, which is captured by the rate of coverage, as given by 
the slopes of the lines in [Figure 13]. Formally, under any given copula. 


Rate of coverage 


Mean(Number of associated values selected) 
Mean (Stopping point) 


( 3 ) 


which clearly does not depend on the sample size. [Table 9] gives the rates 
of coverage under Frank and Gaussian copulae that are matched for the same 
strength of association in the subsample. 


r 

Frank Rate 

Gaussian Rate 

P 

0.3 

1.19 

1.17 

0.45 

0.5 

1.33 

1.30 

0.70 

0.7 

1.49 

1.46 

0.89 


Table 9: Rates of Coverage [Equation (3)] under Frank and Gaussian copulas. 


The simulations compare coverage rates for TKTP for a in {0.1,0.05,0.10} 
to see how stable the coverage probability is in this range. It is enough to 
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consider just Frank copulae with n = 500, mixing proportion = 0.3, and r in 
{0.3,0.5, 0.7}. The standard errors of these rates are all about 0.1, suggesting 
little real difference in the TKTP performance; the comparable Gaussian version 
of [Figure 13] is virtually indistinguishable from that figure, and is thus omitted. 
Nevertheless, the actual coverage is systematically less in all 18 Gaussian copulae 
under study than in their comparable Frank copulae, suggesting that the small 
differences indicate a real but slight underperformance of TKTP in the Gaussian 
setting. 

In all cases, the rate of coverage seems to capture the effectiveness of the 
method. From the table, it remains stable under different types of mixtures; and, 
from [Figure 13], it is not affected by small changes in the mixing proportion. 

The control parameters a and w of the TKTP have different tasks. The 
value a changes the distribution of the stopping point, with smaller values of 
a inducing stochastically larger distributions of the stopping point. The effect 
is much the same as selecting an operating point in a classification method. 
The value a = 0.05 gives a rough midrange value, and lowering it slightly will 
increase both the stopping point and the coverage probability, but lowering it 
too far will greatly increase the probability of wrongly including uncorrelated 
samples in the selection set. For large sample sizes n > 500, we recommend 
a = 0.05 because these quantiles are well estimated in the simulations and the 
stopping point does not change much for other values of a near 0.05. The control 
parameter w helps to stabilize the performance of TKTP. It is useful mostly for 
small sample sizes n > 100, as studied here; we recommend w = 3, the smallest 
practical smoothing. 

5. Application: Long Term Predictive Power of Oil for S&:P 500 

Stocks 

As an example of how TKTP may be used in large datasets, the weekly 
closing prices of stocks currently listed in the Standard & Poor’s 500 index and 
having a 10-year history are correlated against the corresponding price of oil 
six months prior. By eliminating weeks where the overall pattern is broken, 
the TKTP method provides not only a robust estimate of correlation, but also 
the set of common time points over which predictions are most reliable for a 
key cluster of stocks. For most of the stocks in this cluster, oil is not a direct 
cause of the performance of these stocks, but simply a reliable indicator of the 
likely course of stock prices, at least during the time periods ascertained by the 
TKTP. The goal here is not to predict actual stock prices, but to identify likely 
periods of a sustained trend. 
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Figure 14: Cumulative number of stocks associated with 6-month lagged oil. 


The dataset is composed of weekly prices of 26-week lagged S&P 500 stocks 
versus oil over a decade from 12/31/2004 to 1/2/2015 (523 time points, 449 
stocks with 10 years of data). The initial screen follows a two-step selection 
process: 

Step 1. Find stocks with partial association over the most weeks. There 
were 523 — 26 = 497 time point pairs (oil time, stock time) available. The first 
screen was to select stocks associated over at least 60% of these time point pairs. 
There were 273 such stocks correlated with 6-month oil over periods of at least 
6 years. See [Figure 14] which suggests this cutoff. 

Step 2. Find the correlations (both Pearson and Kendall) between oil and 
lagged stock prices over the TKTP-selected time point pairs. There are 77 of 
these with Pearson correlations over 0.9, listed in [Table 10]. 

[Figure 15] depicts the full 10-year time courses of the weekly price of the 
five stocks with the largest Pearson correlations in [Table 10] relative to the 
6-month prior price of oil. These five stocks come from different industries: 
PRGO-pharmaceuticals, CRM-software, ARG-chemicals, CTSH-IT services, 
SRCL-Commercial Services. This suggests that oil price predicts, at least in 
pattern, an aspect of the general economy. Although the huge crash of oil in 
2008 was disproportionate, it did predict a downturn in each of the stocks six 
months later. 

More importantly than general association, the TKTP method helps to iden- 
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FMC 

0.901 

COL 

0.902 

TJX 

0.902 

CNP 

0.902 

BEN 

0.902 

XOM 

0.902 

SIAL 

0.902 

PH 

0.902 

NBL 

0.903 

XRAY 

0.903 

EMC 

0.905 

ISRG 

0.905 

APH 

0.905 

PCAR 

0.906 

DHR 

0.906 

RRC 

0.907 

FTI 

0.907 

FAST 

0.907 

ROK 

0.908 

DRI 

0.908 

EL 

0.908 

RAI 

0.908 

PWR 

0.909 

SPG 

0.909 

PCLN 

0.910 

LH 

0.910 

CL 

0.910 

AMZN 

0.911 

HCP 

0.911 

UTX 

0.912 

ROST 

0.913 

BLK 

0.913 

IBM 

0.913 

EW 

0.914 

SO 

0.914 

GWW 

0.916 

KO 

0.916 

PGP 

0.916 

GIS 

0.916 

ALTR 

0.916 

BBBY 

0.917 

CERN 

0.917 

INTU 

0.917 

DTV 

0.917 

FLS 

0.918 

MOD 

0.918 

XEL 

0.918 

WAT 

0.918 

CTXS 

0.918 

SWK 

0.919 

WEC 

0.919 

ES 

0.920 

AZO 

0.920 

CSX 

0.921 

CMI 

0.922 

FDO 

0.922 

PSA 

0.922 

DLTR 

0.923 

ORCL 

0.923 

YUM 

0.924 

FFIV 

0.924 

RL 

0.924 

ACN 

0.926 

ED 

0.926 

0 

0.926 

EMR 

0.928 

VTR 

0.928 

PX 

0.928 

ESRX 

0.930 

AAPL 

0.930 

RHT 

0.932 

VAR 

0.932 

SRCL 

0.933 

CTSH 

0.934 

ARC 

0.935 

CRM 

0.936 

PRGO 

0.942 


Table 10: Pearson correlations for 77 S&P 500 stocks correlated with 6-month prior oil price 
over different restricted periods of at least 6-year duration. 


tify the time periods over which association is strongest. Pooling information 
from several stocks strengthens the degree of confidence about the time periods 
selected, but the pooling needs to be done methodically. The idea is to pool 
information from stocks that agree strongly with each other. 

A complete-linkage clustering approach is used to find the stocks from which 
to pool time-point inclusions. The agreement measure used for pairs of time- 
point sets is the Jaccard/Tanimoto coefficient J defined for sets A and B as 

j \A^B\ 

\AUB[ 

A cluster of 24 of the 273 Step 1-screened stocks has all pairs with J > 0.8. 
These 24 stocks, listed in [Table 11], include three of the five stocks from [Figure 
15], and 13 of the stocks from [Table 10]. 


CAN 

ARC 

ABC 

APH 

AZO 

BLL 

HSIC 

DTV 

ECL 

EL 

FIS 

FISV 

GIS 

HIL 

REGN 

ROP 

CRM 

SIAL 

SRCL 

TJX 

UNP 

UHS 

WAT 

WEC 


Table 11: Cluster of stocks completely linked by J > 0.8. 
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Figure 15: Five stock prices associated with 6-month lagged oil: 2005-2014. 


The average 24 TKTP correlations with 6-month oil were 0.89 (Pearson) 
and 0.75 (Kendall). [Figure 16] illustrates, for each week, the number of stocks 
having that week included by TKTP. The color coding follows the rainbow with 
red indicating inclusion in all 24 stocks; yellow-green, in 13 stocks; and blue in 
0 stocks. When plotted under the price of oil, it appears that weeks tend to be 
excluded during the most volatile periods for oil prices. 

6. Discussion 

The computational efficiencies attained by the new algorithms enable a whole 
new approach to discovering key subpopulations from big data. Many pairs of 
variables may be screened to see if there is a common subset of observations that 
supports strong association between the pairs, and thus represents a subpopu¬ 
lation supporting a whole network. This could be a subpopulation of cancers 
supporting a gene-network for chemoresistance, or a subpopulation of time pe¬ 
riods supporting an economic network for growth. 

Although the methods here have been presented in the classical view of ana¬ 
lyzing a random sample from a population, they also apply to any large dataset. 
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week 


Figure 16: TKTP-selected weeks for oil price prediction of stock cluster prices. 


Inference here is internal to the dataset treated as its own population. Identified 
subsets correspond to a real subpopulation, but identifying the subpopulation 
externally may not be straightforward. For the stock price example, the apho¬ 
rism, “A rising tide lifts all boats,” seems to apply. A foretelling rise of oil 
prices, ignoring brief periods of large volatility, signals a rise in tide, at least 
in the harbor where the cluster of two dozen stocks is moored. It is not clear 
which stocks will remain in the harbor, or what other, more specific, influences 
will be in force. 

The decision to focus on ranks rather than on the original numerical values 
was motivated both by concerns for inferential robustness and insights from 
recent mathematics. When moving between the database and external envi¬ 
ronments, calibration issues often arise, but causal relationships persist in situ. 
Thus rank associations found within a database subset are very likely to ap¬ 
pear in corresponding external subpopulations. Among rank-based methods, 
the multistage ranking model is very flexible since the number of parameters 
increases with sample size. 

Recent mathematical methods motivate the use of the multistage ranking 
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method. In the special case where all the population parameters {0j} are equal, 
yielding the Mallows model, Starr (2009) has found a limiting distribution which, 
under proper scaling, is a Frank copula. More general parameterizations of the 
multistage model appear to converge to mixtures of Mallows, which covers a very 
wide range of ranking distributions. If this is true, the multistage model for a 
pair {X, Y) of variables ranking the n observations will converge to a mixture of 
Frank copulae. A “signal” in such a mixture of might correspond to a mixture 
of components with large values of Kendalls tau, with “noise,” if present, being 
represented by a mixture of nearly independent Frank copulae. Thus, in this 
conceptual setting, TKTP screens out a nearly independent component. Doing 
this on the basis of a single ranking is remarkable; finding additional pairs of 
variables with the same limiting distribution, as in the example of Section 5, 
adds to the accuracy. See Awasthi et al. (2014) and Meila and Chen (2012) for 
general approaches to mixtures of Mallows models. 
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